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INTRODUCTION 



' National Assessment of Educational Progress (NAEP) is 
a census-like assessment prbiject which ^coliects data on a 
national probability sample. \JJAEP has colieQtfed data in ten 
different curriculum areas from four differejqt age classes 
{9-year-olds^ 13*-year-olds^ 17-year-oldS/ and\young adults 
26-35 years old) . The major purpo^^ of NAEP i-^. to measure 
changes across time in performance olr-objectives-ref erenced 
exercises.' Many of the exercises NAEP uses in its assess- 
ment process are open-ended and mus,t be hand scored. NAEP's 
hand scoring does not generally consist of assigning re- 
sponses to points on an ordinal scale. Instead responses 
are almost always assigned to nominal (descriptive) cate- 
gories.' These nominal categories arc^ "however , classifiable 
as acceptable or unacceptable . ' 

Because NAEP's objective is to measure changes^over 
time in performance, it is important that hand scoring not 
depend heavily on the, scorer or„the time of the>^scoring. It 
is known that when essays are scored for quality on_ ordinal 
scales, the scores vary with^the context in which. the papers 
are read (Coffman, 1971). If these -findings are true also 
for NAEP's data, measurement of change would require that \ 
all responses from all points in time be read/ in. the same 
context and time. If NAEP can show that its semi-prof es- . 
sional scoring is consistent across time and scorers, then 
perhaps change can be measured on these exercises without 
ail responses being re-read each time a change measure is 
made. 



METHODS 

The study was designed to answer several questions: 

1. To what extent does the score a response receives 
depend upon the scorer who scores it? 



Thanks are due to Janet Bailey for her careful computations i 
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2. To what extent does tlie score depend on the time 
(withirf the two-three month scoring session) when 
the response is scored? 

For this study sample responses were selected from the 
actual response data from the Writing and Career and Occupa- 
tional Development (COD) assessments done in, 1973-1974 . 
Three exercises from COD and two exercises from Writing were 
selected at ages 9, 13 , and 17. Three exercises from COD 
were selected for adults.^ (See Table 1.) _For__ each exercise" 
one sample response was selected arbitrarily from each ^of , 
28 administration units spread throughout the country. Each 
sample response was assigh^d a number (1-28) ^ and photo 
copies of the sampleg^efe made. 



Table 1. 







Exercise 


Number of 


Age Content Area 


Number 


(NAEP Number) 


Parts Analyzed 


Writing 


1 


, (0-201002) 


5 


9 


2 


' (0-201012) 


\: 




/ 3 


(2-301034) 




COD 




(2-3P2015') 


' 6 






(2-402002) 


3. 


Writing 


> 6 


(0-201018) 


3 '\ 




7 


(0-202007) 


1 


13 




\ 




8 


(2-102025) 


3 


COD 


9 


(2-302015) 


6 




10 


(2-306012) 


3 


Writing 


11 


(0-201018) 


3 




12 


(0-301008) 


4 


17 








13 


(2-102025) 


3 


COD 


14 


(2-306006) 


5 




15 


(2-306012) 


3 




16 


(2-302OO5) - 


- 10 


Adult COD 


17 


(2-306009) 


12 




■ 18 


(2-306012) 


3 
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The readers were members of the professional scoring 
department at Measureificnt Research Center in LeJ^'^^'Sipty , lA, 
All of these scorers had at least bachelors d/grees; most 
had teaching. experience, » The experimental papers were 
scored as a part of the normal COD -and "l^iting^coring^ 
which involved approximately 100,000 student res|)onses per 
-age. All scorers were trained together on i:he use of up to 
40 different scoring guides before the scoring for each age 
began. The scoring guides consist of a de"scriptive title 
for each category^, illustrated by up to twenty sample re- 
sponses, ^ 

After scoring normal assessment responses for two to 
three weeks, each scorer was given * in randjom order sets of 
photocopies of each sample response. Each scorer indepen- 
dently read each response and recorded the score on a 
separate sheet. The sets of sample responses were then 
collected and given new random' orders . The sets were pre- 
sented again to scorers for rescoring when about one half 
^f all of the dat*a for an age class had been scored and 
again ii^modiately after the scoring for the age class was 
completed; The scoring for each age ■ took from two to three 
months to complete. 



ANALYSES 

Introduction 



The major analysis problem was that 



nominal scoring categories. Many conventional summary sta- 



tistics, however, require ordinal data'. 



several percent-of-agreement summaries that made/ sense to us 



and were based on the raw data. 



most exerc/ses had 



We finally defined 



, The first answers the '.question: 

<1) What is the proba^^ility that all of the scorers 
or all scorers but one will agree on a random 
paper? 

A secdnd type of percent of agreement focuses not on the 
agreement among scorers, but on the. agreement within 
scorers over the three scoring times. It. answers the ques*- 

-t4en^^^^ — I 

(2) What is the probability that a random scorer will 
assign the same category at lea^t twice, or all 
three times?*. 



*For^ge 9 the sample responses were only scored twice, since 
the entire scoring session took only six weeks. 



Further analysis required transformation of the data to 
aik ordinal scale. Since the scorers are believed to be com- 
petent ju^geS/ the score assigned^^by most scorers to a given 
response was assumed to be the true score for that response. 
The. most common score variable was defined as presence or 
absence of this true score. (The most common score variable 
is denoted by MCS . ) ^ 

For -the MCS (most commca score) variable, one further 
percentage was computed. l€ was a more general percent of 
agreement than the percent of agreement oh responses (1) or 
the percent of agreement over time (2), defined above. This 
overall percent of agreement answers the question: 

(3) What is the probability that a random scorer, on 
a random response, ^ at ,any one of three times, 
will assign the- true category? 

All three percents of agreement are discussed below arid pre- 
sented in Attachments 1-4. 

The MCS variable was also used to compute a repeated- 
measures analysis of variance, wij:Ji -responses and scorers 
as random ^ajc.tors^,--and ~tlm (where applicable) exercise 

partr^^as' fixed factors. These Responses x Scorers x Times 
X Parts analyses were meant to answer the questions: 

Do different scorers vary in their ability to assign 
true scores? 

Does time affect scorers' ability to assign true 
scores? ^ ' - 

Does the exercise part* affect ,that ability? 

Does the\^specif ic response affect that ability? 

A second ordinal variable was created by collapsing the 
data into acceptable and unacceptable categories. The A/U 
(acceptable/unacceptable) variable was also used in the Re- 
spondents x Scorers x Times x Parts analysis of variance 

*^ _ ;/ 

^Exercises have parts for several reasons. Parts may be 
two aspects of the same ta^k (such as scores for spelling 
and punctuation) or they may be two attempts at the same 
task (as when respondents are asked to give two reasons 
for something) . 



design/ but the questidjis to be answered differed. One 
would expect both respondent and part scores to vary in 
acceptability — to vary^ that is, in difficulty. The ques-" 
tionfe to be ansv/ered by this analysis are: 

Do scorers differ, in their assignment of acceptable 
scores? 

Does time of scoring affect the assignment of accep- 
table scores? 

Or, in other vjcfTdsr do either scorers or. time affect 
the difficulty of an open-ended measure? 

The analysis of variance for both MCS and A/U are presented 
in Attachments 5-8 below. 

One final analysis df the A/U variable was made, based 
on analysis of variance c^ta. This analysis is related to 
the intra-class correlatibn or Cronbach's alpha. It 
differs, however, in that it is based 'on a multi-factor 
design. It involves estimating the generalizability of 
the scoring from a ratio of relevant components of vari- 
arK:e.* (For a general discussion of the technique, see 
Stanley, 1971). 

Specifically, the among-respondents ccr^ponent of vari- 
ance is taken as an.^estimate of variance in the population 
of the ability to perform, or not perform, the exercise. 
That is, it is taken as an estimate of the variance of the. 
population true score. That variance component is divid'ed 
by the sum of the variance components judged to be relevant 
to the actual (as opposed to the experimental) scoring 
situation. The resulting ratio 6an be interpreted as a 
reliability estimate:" a ratio of true v^ariance to total 
var iance . 

Those components of variance determined to be relevant 
were: ^ v 

among persons, the "estimate of true variance. 

(2^— among -times: since a norm^ hand scoring takes 
two-three months to complet^v** 




Expected Mean Squares were constructed by^:the BMD 08V 
analysis of variance program (Dixon, 1973).^^^^ 
**See Glass and Hakstian (1968) for a critical dl:^qussion ol 
including fixed effects ia an analysis of this typfe^ We ^ 
felt justified, since our selection of times would re^ilt 
in a maximum variance due to time, and thus create a coiv^ 
servative estimate of reliability. 



(3) among scorer sl: . since NAEP data are based on sums 
of items scored by different scorers. 

(4) all interaction of the above factors^. 

Note that the variance due to parts was omitted. Variance 
among parts might be construed as part of the true vari- 
ance.' It was nevertheless jiot inqjuded, because parts were 
a fixed farctor. (see note on preceding page) and thus might 
seriously, bias the ratio. These variance ratios are pre- 
sented in Attachments 1^4. . 

\ 

\ 

/• * RESULTS AND DISCUSSION 



Percent Agreement . - ^ 

The data were initially analyzed by calculating, for 
each -exercise, the percentage of the sample responses on 
which all scorers agreed upon the category assignments. 
These percentages showed considerable variation across 
exercisers ranging in value from 58.4% to 93.8% with the mean 
percentage being 76.5%.' The overall mean for agreement of 
all but one scorer was 86.3%. n 

Next the categories assigned to the responses by each 
reader for all readings were compared. The average percent- 
of reader agreements were calculated on each exercise. 
These percentages varied from 82.6% to 98.7% and a, mean of 
90.4%. 

The overall agreement, based on presence or absence 
of most common score, ranged from 88.8% to 99.5%, with an 
average of 94.1%. 

All the percentages seem to be high enough to indicate 
that NAEP hand scoring is not heavily dependent upon the 
scorer. The percentages s^e displayed, in Attachments 1-4. ' 

There is a slight advantage for COD over Writing 
exerciges on all three indices. Since the average advantage 
across";a*ges 9, 13 and 17 is no greater than 5% on any index, 
Ve conclude that the scoring is essentially e^tiivalent for 
both subjects. Indeed, since the COD exercises studied 
cover topics appropriate to mathematics- and citizenship as 
well as career education, y^e are tempted to generalize to' 
all NAEP scoring (except arjt>. literature, and musici). 



/ 
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Analysis of Variance - MCS 

If NAEP's scoring procedure were perfectly generaliz- 
able over scorers, times, respondents, and different * 
exercise parts, there would-be no variation at all in 
assigninent\of the most common score. We would, therefore, 
prefer to find no significant analysis 'of variance effects. 
The worst possibly result would be to find large effects 
tor scorers or t;imes. Thajt would mea.n that the baseline 
NAEP' data would have to be rescored every time NAEP wanted 
to n^easure change or a state or local assessment wanted to , 
compare its results with the NAEP baseline. Even if the ^ 
expense of such re-scoring were tolerable, it would be ex- 
tremely^nelegant for NAEP's baseline, criterion results to . 1 
change for every different comparison. 

Inspection of Attachments 5-8, "Probability Levels 
A'ssociated with the^F-Ratios for the Analysis of Variance" 
fdr the MCS results^ show two strong and consistent effects* 
These are the effects for responses and fpr the responses 
by -exercise parts interaction. It is not surprising — 
though regrettable — that the ' consistency of the scoring 
depends on how people respond and what they are responding 
to. 

There do not appear to be any consistent effects for , 
times or for scorers but the scorers by times interactions ,^ 
appear more often than one would like. To evaluate the 
importance of these effects, components of variance were . 
'^estimated and, from these, the percentage of total vari- 
ance was calculated for each effeqt. This analysis showed \ 
that^ approximately 17% of ^he variance (over all exercises) \ 
coula be attributed tq responses and parts combined, and 
less than 1% could be attributed to scorers and times com- 
bined. Thus the effects, even for responses and parts, are 
small, though statistically stable. 

Analysis of Variance - A/U 

In contrast to the MCS analysis, one would expect large 
variations among responses and parts for the acceptable/ 
unacceptable variable. In fact, the variance among responses 
\is — as mentioned above--an estimate of the variance among 
Vrue scores in the sample. However, as with the- MCS, variance 
among scorers or times is a strong blow at the generalizability 
of 'the scoring procedure. 

\ The results of the U/A ^analyses of variance are siammarized in 
Attachments 5-8. Again, the only strong and consistent effects 
are\for responses and the responses by parts interaction, 
o Theses two effects- combined account for over 73% of the variance 
across all exercises. In contrast, the two effects account 




for only 17% of the variance in, the MCS variable / above. 
/ Thus^ for the A/U variable, the effect of responses and 
parts is both statistically stable and large. 

The effects for scorets and times are negligible. 

Components of Variance Estimates of Generalizability 

Components of variance for the A/U variable were also, 
used to compute an estimate of th<s ratio of true variance 
to total variance in the. ability tp perform the exercise 
acceptably. The results are displayed in column X4) of 
Attachments 1-4. Note that this ct^efficient is affected 
by the lack of variance among responses on very easy (or 
very difficult) exercises. In particular , exercises 9 and 
18 — which were answered correctly b^ 99% of respondents — 
have a coefficient of less than .35 pimply ^ because there 
was almost nc /ariation among responses. Exercises 8^ 13/ 
and 15 also were answered correctly by more than 90% of the 
respondents. \ 

The median percentage ' of variance accounted for was 
.80. This is a conservative estimate o^ the generalizability 
of NAEP scoring, since^ it is based on single exercises / 
answered by only 28 respondents; since variations due to 
scorers and times are included in the err6r variance; and 
since 5 of the 18 exercises included were ^xtremely easy. 
While the question always remains of how reliable is reliable 
enough/ the present investigators were extremely pleased 
with^ a median coefficient of ',.80. .We feel that good evidence 
npw exists that NAEP scoring procedijres will\ generalize to 
Qther times — for change measures — and other users — for local . 
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